49 research outputs found

    Enforcing Predictability of Many-cores with DCFNoC

    Get PDF
    © 2021 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] The ever need for higher performance forces industry to include technology based on multi-processors system on chip (MPSoCs) in their safety-critical embedded systems. MPSoCs include a network-on-chip (NoC) to interconnect the cores between them and with memory and the rest of shared resources. Unfortunately, the inclusion of NoCs compromises guaranteeing time predictability as network-level conflicts may occur. To overcome this problem, in this paper we propose DCFNoC, a new time-predictable NoC design paradigm where conflicts within the network are eliminated by design. This new paradigm builds on top of the Channel Dependency Graph (CDG) in order to deterministically avoid network conflicts. The network guarantees predictability to applications and is able to naturally inject messages using a TDM period equal to the optimal theoretical bound without the need of using a computationally demanding offline process. DCFNoC is integrated in a tile-based many-core system and adapted to its memory hierarchy. Our results show that DCFNoC guarantees time predictability avoiding network interference among multiple running applications. DCFNoC always guarantees performance and also improves wormhole performance in a 4 × 4 setting by a factor of 3.7× when interference traffic is injected. For a 8 × 8 network differences are even larger. In addition, DCFNoC obtains a total area saving of 10.79% over a standard wormhole implementation.This work has been supported by MINECO under Grant BES-2016-076885, by MINECO and funds from the European ERDF under Grant TIN2015-66972-C05-1-R and Grant RTI2018-098156-B-C51, and by the EC H2020 RECIPE project under Grant 801137.Picornell-Sanjuan, T.; Flich Cardo, J.; Hernández Luz, C.; Duato Marín, JF. (2021). Enforcing Predictability of Many-cores with DCFNoC. IEEE Transactions on Computers. 70(2):270-283. https://doi.org/10.1109/TC.2020.2987797S27028370

    Efficient and scalable starvation prevention mechanism for token coherence

    Full text link
    [EN] Token Coherence is a cache coherence protocol that simultaneously captures the best attributes of the traditional approximations to coherence: direct communication between processors (like snooping-based protocols) and no reliance on bus-like interconnects (like directory-based protocols). This is possible thanks to a class of unordered requests that usually succeed in resolving the cache misses. The problem of the unordered requests is that they can cause protocol races, which prevent some misses from being resolved. To eliminate races and ensure the completion of the unresolved misses, Token Coherence uses a starvation prevention mechanism named persistent requests. This mechanism is extremely inefficient and, besides, it endangers the scalability of Token Coherence since it requires storage structures (at each node) whose size grows proportionally to the system size. While multiprocessors continue including an increasingly number of nodes, both the performance and scalability of cache coherence protocols will continue to be key aspects. In this work, we propose an alternative starvation prevention mechanism, named priority requests, that outperforms the persistent request one. This mechanism is able to reduce the application runtime more than 20 percent (on average) in a 64-processor system. Furthermore, thanks to the flexibility shown by priority requests, it is possible to drastically minimize its storage requirements, thereby improving the whole scalability of Token Coherence. Although this is achieved at the expense of a slight performance degradation, priority requests still outperform persistent requests significantly.This work was partially supported by the Spanish MEC and MICINN, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01. Antonio Robles is taking a sabbatical granted by the Universidad Politecnica de Valencia for updating his teaching and research activities.Cuesta Sáez, BA.; Robles Martínez, A.; Duato Marín, JF. (2011). Efficient and scalable starvation prevention mechanism for token coherence. IEEE Transactions on Parallel and Distributed Systems. 22(10):1610-1623. doi:10.1109/TPDS.2011.30S16101623221

    HP-DCFNoC: High Performance Distributed Dynamic TDM Scheduler Based on DCFNoC Theory

    Full text link
    (c) 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] The need for increasing the performance of critical real-time embedded systems pushes the industry to adopt complex multi-core processor designs with embedded networks-on-chip. In this paper we present hp-DCFNoC, a distributed dynamic scheduler design that by relying on the key properties of a delayed confict-free NoC (DCFNoC) is able to achieve peak performance numbers very close to a wormhole-based NoC design without compromising its real-time guarantees. In particular, our results show that the proposed scheduler achieves an overall throughput improvement of 6.9x and 14.4x over a baseline DCFNoC for 16 and 64-node meshes, respectively. When compared against a standard wormhole router 95% of its network throughput is preserved while strict timing predictability as property is kept. This achievement opens the door to new high performance time predictable NoC designs.This work was supported in part by the Secretara de Estado de Investigacin Desarrollo e Innovacin (MINECO) under Grant BES-2016-076885, in part by the European Regional Development Fund (ERDF) under Grant TIN2015-66972-C05-1-R and Grant RTI2018-098156-B-C51, and in part by the EC H2020 European Institute of Innovation and Technology (SELENE) Project under Grant 871467.Picornell-Sanjuan, T.; Flich Cardo, J.; Duato Marín, JF.; Hernández Luz, C. (2020). HP-DCFNoC: High Performance Distributed Dynamic TDM Scheduler Based on DCFNoC Theory. IEEE Access. 8:194836-194849. https://doi.org/10.1109/ACCESS.2020.3033853S194836194849

    Accurately modeling the on-chip and off-chip GPU memory subsystem

    Full text link
    [EN] Research on GPU architecture is becoming pervasive in both the academia and the industry because these architectures offer much more performance per watt than typical CPU architectures. This is the main reason why massive deployment of GPU multiprocessors is considered one of the most feasible solutions to attain exascale computing capabilities. The memory hierarchy of the GPU is a critical research topic, since its design goals widely differ from those of conventional CPU memory hierarchies. Researchers typically use detailed microarchitectural simulators to explore novel designs to better support GPGPU computing as well as to improve the performance of GPU and CPU-GPU systems. In this context, the memory hierarchy is a critical and continuously evolving subsystem. Unfortunately, the fast evolution of current memory subsystems deteriorates the accuracy of existing state-of-the-art simulators. This paper focuses on accurately modeling the entire (both on-chip and off-chip) GPU memory subsystem. For this purpose, we identify four main memory related components that impact on the overall performance accuracy. Three of them belong to the on-chip memory hierarchy: (i) memory request coalescing mechanisms, (ii) miss status holding registers, and (iii) cache coherence protocol; while the fourth component refers to the memory controller and GDDR memory working activity. To evaluate and quantify our claims, we accurately modeled the aforementioned memory components in an extended version of the state-of-the-art Multi2Sim heterogeneous CPUGPU processor simulator. Experimental results show important deviations, which can vary the final system performance provided by the simulation framework up to a factor of three. The proposed GPU model has been compared and validated against the original framework and the results from a real AMD Southern-Islands 7870HD GPU. (C) 2017 Elsevier B.V. All rights reserved.This work was supported in part by Generalitat Valenciana under grant AICO/2016/059, by the Spanish Ministerio de Economía y Competitividad (MINECO) and Plan E funds under Grant TIN2015-66972-C5-1-R, and by Programa de Ayudas de Investigación y Desarrollo (PAID) de la Universitat Politècnica de València .Candel-Margaix, F.; Petit Martí, SV.; Sahuquillo Borrás, J.; Duato Marín, JF. (2018). Accurately modeling the on-chip and off-chip GPU memory subsystem. Future Generation Computer Systems. 82:510-519. https://doi.org/10.1016/j.future.2017.02.012S5105198

    L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors

    Full text link
    © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Improving the utilization of shared resources is a key issue to increase performance in SMT processors. Recent work has focused on resource sharing policies to enhance the processor performance, but their proposals mainly concentrate on novel hardware mechanisms that adapt to the dynamic resource requirements of the running threads. This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware. Unlike previous work, this paper concentrates on thread allocation, by selecting the proper pair of co-runners to be launched to the same core. The relation between L1 bandwidth requirements of each benchmark and its performance (IPC) is analyzed. We found that for individual benchmarks, performance is strongly connected to L1 bandwidth consumption, and this observation remains valid when several co-runners are launched to the same SMT core. Based on these findings we propose two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively. The aim of these policies is to properly balance L1 bandwidth requirements of the running threads among the processor cores. Experiments on a Xeon E5645 processor show that the proposed policies significantly improve the performance of the Linux OS kernel regardless the number of cores considered.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01; and by Programa de Apoyo a la Investigacion y Desarrollo (PAID-05-12) of the ´ Universitat Politecnica de Val ` encia under Grant SP20120748Feliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2013). L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. IEEE. https://doi.org/10.1109/PACT.2013.6618810

    Cache-Hierarchy contention-aware scheduling in CMPs

    Full text link
    © © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksTo improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.This work was supported by the Spanish MINECO under Grant TIN2012-38341-C04-01, and by the Universitat Politecnica de Valencia under Grant PAID-05-12 SP20120748.Feliu Pérez, J.; Petit Martí, SV.; Sahuquillo Borrás, J.; Duato Marín, JF. (2014). Cache-Hierarchy contention-aware scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems. 25(3):581-590. https://doi.org/10.1109/TPDS.2013.61S58159025

    Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches

    Full text link
    © Owner/Author 2013. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '13 Proceedings of the 27th international ACM conference on International conference on supercomputing; http://dx.doi.org/10.1145/2464996.2467278.This work introduces a novel refresh mechanism that leverages reuse information to decide which blocks should be refreshed in an energy-aware eDRAM last-level cache. Experimental results show that, compared to a conventional eDRAM cache, the energy-aware approach achieves refresh energy savings up to 71%, while the reduction on the overall dynamic energy is by 65% with negligible performance losses.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and Plan E funds, under Grants TIN-2009-14475-C04-01 and TIN2012-38341-C04-01.Valero Bresó, A.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2013). Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches. ACM. https://doi.org/10.1145/2464996.2467278

    Addressing fairness in SMT multicores with a progress-aware scheduler

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling. This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness. Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3× over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program AwardFeliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48

    Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores

    Full text link
    [EN] Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness. In this work we present a process scheduler for SMT multicores that simultaneously addresses both performance and fairness. This is a major design issue since scheduling for only one of the two targets tends to damage the other. To address performance, the scheduler tackles bandwidth contention at the L1 cache and main memory. To deal with fairness, the scheduler estimates the progress experienced by the processes, and gives priority to the processes with lower accumulated progress. Experimental results on an Intel Xeon E5645 featuring six dual-threaded SMT cores show that the proposed scheduler improves both performance and fairness over two state-of-the-art schedulers and the Linux OS scheduler. Compared to Linux, unfairness is reduced to a half while still improving performance by 5.6 percent.We thank the anonymous reviewers for their constructive and insightful feedback. This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2014-62246EXP, and by the Intel Early Career Faculty Honor Program Award.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2017). Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers. 66(5):905-911. https://doi.org/10.1109/TC.2016.2620977S90591166

    Constructing virtual 5-dimensional tori out of lower-dimensional network cards

    Full text link
    [EN] In the Top500 and Graph500 lists of the last years, some of the most powerful systems implement a torus topology to interconnect themillions of computing nodes they include. Some of these torus networks are of five or six dimensions, which implies an additional difficulty as the node degree increases. In previous works, we proposed and evaluated the nD Twin (nDT) torus topology to virtually increase the dimensions a torus is able to implement. We showed that this new topology reduces the distances between nodes, increasing, therefore, global network performance. In this work, we present how to build a 5DT torus network using a specific commercial 6-port network card (EXTOLL card) to interconnect those nodes. We show, using the same number of cards, that the performance of the 5DT torus network we are able to implement using our proposal is higher than the performance of the 3D torus network for the same number of compute nodes.Spanish MINECO; European Commission, Grant/Award Number: TIN2015-66972-C5-1-R and TIN2015-66972-C5-2-R; JCCM, Grant/Award Number: PEII-2014-028-P; Spanish MICINN, Grant/Award Number: FJCI-2015-26080Andújar-Muñoz, FJ.; Villar, JA.; Sanchez Garcia, JL.; Alfaro Cortes, FJ.; Duato Marín, JF.; Fröning, H. (2017). Constructing virtual 5-dimensional tori out of lower-dimensional network cards. Concurrency and Computation Practice and Experience. 1-17. https://doi.org/10.1002/cpe.4361S11
    corecore